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Abstract 

The goal of a learner in standard online learning 
is to maintain an average loss close to the loss 
of the best-performing single function in some 
class. In many real-world problems, such as rat- 
ing or ranking items, there is no single best target 
function during the runtime of the algorithm, in- 
stead the best (local) target function is drifting 
over time. We develop a novel last-step min- 
max optimal algorithm in context of a drift. We 
analyze the algorithm in the worst-case regret 
framework and show that it maintains an aver- 
age loss close to that of the best slowly changing 
sequence of linear functions, as long as the to- 
tal of drift is sublinear. In some situations, our 
bound improves over existing bounds, and addi- 
tionally the algorithm suffers logarithmic regret 
when there is no drift. We also build on the Hoo 
filter and its bound, and develop and analyze a 
second algorithm for drifting setting. Synthetic 
simulations demonstrate the advantages of our 
algorithms in a worst-case constant drift setting. 

1 Introduction 

We consider the on-line learning problems, in which a 
learning algorithm predicts real numbers given inputs in a 
sequence of trials. An example of such a problem is to pre- 
dict a stock's prices given input about the current state of 
the stock-market. In general, the goal of the algorithm is to 
achieve an average loss that is not much larger compared 
to the loss one suffers if it had always chosen to predict ac- 
cording to the best-performing single function from some 
class of functions. 

In the past half a century, many algorithms were pro- 
posed (a review can be found in a comprehensive book on 
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the topic 1 10 1) for this problem, some of which are able to 
achieve an average loss arbitrarily close to that of the best 
function in retrospect. Furthermore, such guarantees hold 
even if the input and output pairs are chosen in a fully ad- 
versarial manner with no distributional assumptions. 

Competing with the best fixed function might not suffice 
for some problems. In many real-world applications, the 
true target function is not fixed, but is slowly drifting over 
time. Consider a function designed to rate movies for rec- 
ommender systems given some features. Over time a rate 
of a movie may change as more movies are released or the 
season changes. Furthermore, the very own personal-taste 
of a user may change as well. 

With such properties in mind, we develop new learning al- 
gorithms designed to work with target drift. The goal of 
an algorithm is to maintain an average loss close to that of 
the best slowly changing sequence of functions, rather than 
compete well with a single function. We focus on prob- 
lems for which this sequence consists only of linear func- 
tions. Some previous algorithms ETl Q] |22] [25) designed 
for this problem are based on gradient descent, with addi- 
tional control on the norm (or Bregman divergence) of the 
weight- vector used for prediction |25|, or the number of 
inputs used to define it Q . 

We take a different route and derive an algorithm based on 
the last-step min-max approach proposed by Forster [17] 
and later used |34] for online density estimation. On each 
iteration the algorithm makes the optimal min-max predic- 
tion with respect to a quantity called regret, assuming it is 
the last iteration. Yet, unlike previous work, it is optimal 
when a drift is allowed. As opposed to the derivation of the 
last-step min-max predictor for a fixed vector, the resulting 
optimization problem is not straightforward to solve. We 
develop a dynamic program (a recursion) to solve this prob- 
lem, which allows to compute the optimal last-step min- 
max predictor. We analyze the algorithm in the worst-case 
regret framework and show that the algorithm maintains an 
average loss close to that of the best slowly changing se- 
quence of functions, as long as the total drift is sublinear 
in the number of rounds T. Specifically, we show that if 
the total amount of drift is Tv (for v = o(l)) the cumula- 
tive regret is bounded by TjW 3 + log(T). When the in- 
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stantaneous drift is close to constant, this improves over a 
previous bound of Vaits and Crammer [35 1 of an algorithm 
named ARCOR that showed a bound of IV 1 / 4 log(T). Ad- 
ditionally, when no drift is introduced (stationary setting) 
our algorithm suffers logarithmic regret, as for the algo- 
rithm of Forster ifTTll . We also build on the Tp^ adaptive 
filter, which is min-max optimal with respect to a filter- 
ing task, and derive another learning algorithm based on 
the same min-max principle. We provide a regret bound 
for this algorithm as well, and relate the two algorithms 
and their respective bounds. Finally, synthetic simulations 
show the advantages of our algorithms when a close to con- 
stant drift is allowed. 

2 Problem Setting 

We focus on the regression task evaluated with the squared 
loss. Our algorithms are designed for the online setting and 
work in iterations (or rounds). On each round an online 
algorithm receives an input-vector x t e R d and predicts 
a real value y t € R. Then the algorithm receives a target 
label yt £ M associated with x t , uses it to update its pre- 
diction rule, and then proceeds to the next round. 

On each round, the performance of the algorithm is eval- 
uated using the squared loss, ^t(alg) = i(yt,Vt) — 
(yt — yt) 2 - The cumulative loss suffered over T iterations 
is, Lx(alg) = Ylt=x ^t( a lg)- The g° a l of the algorithm is 
to have low cumulative loss compared to predictors from 
some class. A large body of work is focused on linear pre- 
diction functions of the form /(x) = x T u where u e M. d is 
some weight-vector. We denote by i? t (u) = (x^u— y t ) 2 
the instantaneous loss of a weight-vector u. 

We focus on algorithms that are able to compete against se- 
quences of weight-vectors, (ui, . . . , uy) 6 R d x • • • x M. d , 
where u t is used to make a prediction for the tth exam- 
ple (xj, yt). We define the cumulative loss of such set by 
Lrdut}) = J2t ^*( u *) an d me regret of an algorithm by 
#r({u t }) = T,t(Vt - yt) 2 ~ L T ({u t }) . The goal of 
the algorithm is to have a low-regret, and formally to have 
i?T({u t }) = o(T), that is, the average loss suffered by 
the algorithm will converge to the average loss of the best 
linear function sequence (u! . . . Uy). 

Clearly, with no restriction or penalty over the set {u t } the 
right term of the regret can easily be zero by setting, u t = 
x t (yt/ |jx t || 2 ), which implies ^(u t ) = for all t. Thus, 
in the analysis below we incorporate the total drift of the 
weight- vectors defined to be, 



T-l 



V 



^ T ({uJ)=]T||u t -u t+1 || 2 , ^({ Ut })=- , (1) 



t=l 



where v is the average drift . Below we bound the 
regret with, R T ({u t }) < O (r%Vi + log(T)) = 



O (tvz + log(T)^ . Next, we develop an explicit form of 
the last-step min-max algorithm with drift. 

3 Algorithm 

We define the last-step minmax predictor yx to b^] 

' T 

~ yt) 2 



arg mm max 

Vt Vt 



where we define 



min Q T (ui,...,u T ) 

Ui ,. . . ,Ut 



, (2) 



Qt (ui, . . . ,u t ) =6||ui|| 2 + \\ u s+i ~ u s || 2 

s=l 

t 

+ ^(y s -ujx s ) 2 , (3) 



for some positive constants b, c. The last optimization prob- 
lem can also be seen as a game where the algorithm chooses 
a prediction y t to minimize the last-step regret, while an 
adversary chooses a target label y t to maximize it. The 
first term of Q is the loss suffered by the algorithm while 
Qt (ui, . . . , u t ) defined in <(3j is a sum of the loss suf- 
fered by some sequence of linear functions (ui, . . . , u*), 
a penalty for consecutive pairs that are far from each other, 
and for the norm of the first to be far from zero. 

We first solve recursively the inner optimization problem 
min Ul Ut Qt (ui, . . . , u t ), for which we define an auxil- 
iary function, 



Pt (ut) 



min Q t (ui, . . . ,u t ) 

Ul,...,U t _l 



(4) 



which clearly satisfies, 



min Q t (u x , . . . ,u t ) = minP t (u f ) . (5) 

ui,...,u t u t 



We start the derivation of the algorithm with a lemma, stat- 
ing a recursive form of the function-sequence P t (u t ). 

Lemma 1. For t = 2, 3, . . . 



Pi(ui) =Qi(ui) 



Pt (u*) = min P t -x (u t _i) +c||u t - u t _i| 

U{_1 \ 



[yt 



u t X, 



l yr and yr serve both as quantifiers (over the min and max 
operators, respectively), and as the optimal arguments of this op- 
timization problem. 
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The proof appears in App. |B.1| Using Lem. [TJ we write 
explicitly the function P t (u t ). 

Lemma 2. The following equality holds 



Pt (u t ) = u7D t u t - 2uje t + f t , 



where, 



Di = 61 + x lX 7 , D* = (T>t\ + c- 1 !) 



(6) 



x 4 x' (7) 



ei= t/iXi , e t = (1 + c D 4 _i) e t -i+yt*t (8) 

h=yl,ft = ft-i - ej_, (cl + D^x)- 1 et_i + y t 2 . 

(9) 

is a positive definite matrix, e t G 



Note that D t G 



pdxd 



The proof appears in App. |B.2| From Lem. [2] we conclude, 
by substituting |6]l in |5]), that, 

min Q t (ui,...,u t ) 

ui,...,u t 

= min (u7D t u t - 2u7e t + f t ) = -ejl>t 1 e t + f t . 

(10) 

Substituting (|T0]> back in Q we get that the last-step min- 
max predictor yx is given by, 



arg mm max 

Vt Vt 



^2 (Vt - Vtf + e^D^er - f T 



t=i 



(11) 



Since depends on yx we substitute ([8) in the second 
term of ( fTT) , 

ejD^e^ — 

(l + c _1 D T -i) 1 er-i + J^t-xt) D^ 1 



(I + c 1 D T _ 1 ) 1 e T _! + i/ T x T 



(12) 



Substituting ( fT2") > and |9]) in ( fTT) and omitting terms not 
depending explicitly on yx and yx we get, 



yx = arg mm max 

Vt Vt 



(Ut - Vt) 2 + xjD^xy 
+ (I + cr^x-iY 1 e T -i 

xJD^ 1 x t ) yf + y\ 



Vx 



arg mm max 

Vt Vt 



(13) 



2y T (xfDy 1 (I + c ^r-i) 1 e T -i - Vt 



The last equation is strictly convex in yx and thus the op- 
timal solution is not bounded. To solve it, we follow an 
approach used by Forster in a different context ifTTl . In 
order to make the optimal value bounded, we assume that 
the adversary can only choose labels from a bounded set 



yx G [— y, y}- Thus, the optimal solution of ( fT3| l over yx 
is given by the following equation, since the optimal value 

is yx G {+Y, -Y}, 



yx = arg mm 

Vt 



(xJD^xt) Y 2 +y T 



2Y xjD- 1 (l + c _1 D T -i) 



er-i - 2/t 



This problem is of a similar form to the one discussed 
by Forster [17|, from which we get the optimal solution, 



yx = clip ^xjD T : (I 



C I); 



ex-i,Y] , where 



for y > we define clip{x, y) = sign(ir) min{|a;|, y}. The 
optimal solution depends explicitly on the bound Y, and 
as its value is not known, we thus ignore it, and define the 
output of the algorithm to be, 



VT 



i -l ) er-i 



(14) 



We call the algorithm LASER for last step adaptive regres- 
sor algorithm, and it is summarized in Fig. [TJ Clearly, for 
c = oo the LASER algorithm reduces to the AAR algo- 
rithm of Vovk J36), or the last-step min-max algorithm 
of Forster [17]. See also the work of Azoury and War- 
muth 1 2 ] . The algorithm can be combined with Mercer ker- 
nels as it employs only sums of inner- and outer-products 
of its inputs. This algorithm can be seen also as a for- 
ward algorithm [2|: The predictor of ( fT~4| > can be seen as 
the optimal linear model obtained over the same prefix of 
length T — 1 and the new input with fictional-label 
yx = 0. Specifically, from ([8]) we get that if yx = 0, then 
= (I + c _1 Dt-i) The prediction of the opti- 

mal predictor defined in ( fT0| ) is xj ux = xjD^ 1 eT = yx, 
where yx was defined in {14) , 

4 Analysis 

We now analyze the performance of the algorithm in the 
worst-case setting, starting with the following technical 
lemma. 

Lemma 3. For all t the following statement holds, 

DViD^xjD^DVi D t \ 

+ D' ( _ 1 (D- 1 D' i _ 1 + c - 1 l) ^0 



where D' t _i = (I + c _1 D t _!^ 



The proof appears in App. B.3 We next bound the cumu- 
lative loss of the algorithm, 

Theorem 4. Assume the labels are bounded sup ( \yt \ <Y 
for some Y G R. Then the following bound holds, 
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Parameters: < b < c 

Initialize: Set D = {bc)/(c - b) I G R dxd and e = 

OeR d 

For t = 1, . . . ,Tdo 

• Receive an instance x t 

• Compute D t = (D^_\ + c _1 l) _1 + x t x^ 

• Output prediction 

ft = x^Dj- 1 (I + c^Dn) -1 et_i 

• Receive the correct label y t 

• Update: e t = (I + c _1 Dt_i)~ e t _i + y t x t <[«]) 
Output: e T , D T 



Parameters: 1 < a , < 6, c 

Initialize: Set P = 6 _1 I £ R dxd and w = e R d 

For f = 1, . . . ,Tdo 

• Receive an instance x t 

• Output prediction ft — xjw t -i 

• Receive the correct label y t 

• Compute P t = (Pt-a + (a - l)xtx7) 1 

• Update w t = w t _i + aP t {vt - ft)x t 

• Update P t =P t +c- 1 I 

Output: w T , P T 



Figure 1: LASER: last step adaptive regression algorithm. 



L T (LASER) < min 

U ] ,...,Uj' 



bWuif + cVrdut}) 



+ L T ({u t }) 



Proof. Fix t. A long algebraic manipulation yields, 

(yt-yt) 2 + min Q t _i (ui, . . . ,u t _i) 

ui,...,u t _i 

- min Q t (u lt . . . ,u t ) 

ui,...,u t 

= (yt - ft)' + 2y t x7Dr 1 D' t _ 1 e f _ 1 

D t - 1 1 +D' t _ 1 (D t - 1 D / t _ rr c- 1 l) e t _ x 

+ ^xjD^xt - y 2 . (15) 

Substituting the specific value of the predictor ft = 
x t r D t " 1 D' t _ 1 e t _ 1 from (jT4), we get that ( fTB) equals to, 



+e t T _i 



ft 2 + y 2 x7D f 'xt+e^ 



DVifDr'DVi + c- 1 !) 



e^DViD^XtX^D^DViet-x + e t '_i 



e t -i 



DVifD^D'^ + c- 1 !) 



e t _i +y t 2 x7D t 



D' t _!D ( r^x^D^D', _! - D-\ 



(16) 



Figure 2: An iJoo algorithm for online regression. 

Using Lem. [5] we upper bound ( fTo*) with, t/ 2 Xj~D^ 1 Xt < 
y 2 x t r D t " 1 x t . Finally, summing over t 6 {1,...,T} 
gives the desired bound, 



z2 (yt - yt) 2 - min 

* * Ui, Ut 

t = l 

T 



L T (LASER)- min 

Ui ,...,Ut 



< y 2 ^x7 Dr 1 x t 



T-l 



t=l 



6||u 1 |r+cF T ({u t })+L T ({u t }) 



□ 



In the next lemma we further bound the right term of 
Thm. |4] This type of bound is based on the usage of the 
covariance-like matrix D. 



Lemma 5. 

T 



^Tx^D^Xt <ln 



t=i 



■^TefPt-i) . (17) 



{=i 



Proof. Similar to the derivation of Forster [17| (details 
omitted due to lack of space), 



xJD7 1 x t < In 



In 
In 



|D t | 



D t - x t x t ' 
ID* 1 



In 



|D t | 



'P7\ + c-ny 



|D t _i| 

[Ptj 



■lnKl + c^Dt.!)! 



and because IhI^DqI > we get 53t=i x 7D t 1 X( < 



eM+ftVD^x,. ln|iD T | +ELln|(l + c- 1 D t _ 1 )| < m|£D, 



Et= 1 1r(D t _ 1 ) 



□ 
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At first sight it seems that the right term of ( fTT) may grow 
super-linearly with T, as each of the matrices ~D t grows 
with t. The next two lemmas show that this is not the case, 
and in fact, the right term of ( fT7] i is not growing too fast, 
which will allow us to obtain a sub-linear regret bound. 
Lem. [6] analyzes the properties of the recursion of D de- 
fined in |7]) for scalars, that is d = 1. In Lem. [7] we extend 
this analysis to matrices. 

Lemma 6. Define f(X) = A/3/ (X + (3) + x 2 for /3, A > 
and some x 2 < ~f 2 . Then: (1) /(A) < /3 + 7 2 (2) /(A) < 

A + 7 2 (3)/(A)< m ax|A, ^ + ^^ l . 



The proof appears in App. B.6 We build on Lem. [6] to 
bound the maximal eigenvalue of the matrices D t . 

Lemma 7. Assume ||x t || 2 < X 2 for some X. 
Then, the eigenvalues of D t (for t > I), denoted 
by Xi (D t ), are upper bounded by max^ A^ (D 4 ) < 

.{ 3x2 +^ xi+4X * c -,b + x 2 \. 



max 



Proof. By induction. From |7]i we have that A^(Di) < 
b + X 2 for i = 1, . . . , d. We proceed with a proof for 
some t. For simplicity, denote by A; = Ai(D f _!) the ith 
eigenvalue of T>t-i with a corresponding eigenvector Vi. 
From |7) we have, 



D 4 = (D t -i x 



^(D^+c- 1 !) 



A t c 
Xi + c 

18) 



I||x t | 



(18) 



Plugging Lem. [6] in ( |l <sp we get, 

Ei W max | 3^ + yy+iX^ i& + x2 1 



max 



{ 3 * 2 +^ 4+4 ^ c ,& + x 2 }i 



□ 



Finally, equipped with the above lemmas we prove the 
main result of this section. 

Corollary 8. Assume ||x t || 2 < X 2 , \y t \ < Y. Then, 



Lt [LASER) < 6||uiH 2 + L T ({u t }) + Y 2 ln 



D 7 



c -iy^Tr (D ) + cy 



+c" 1 y 2 Tdmax 



3X 2 + Va: 4 + 4a: 2 



(19) 



Furthermore, set b 



ec for some < e < 1. 

-j- anc/ Af = 

max{3X 2 ,6 + AT 2 }. //V < T ^ Y 3 ^ X (low drift) then 



(b+X 2 ) 

Denote by p = max ^ 9/8AT 2 , v gx2 - 



fey setting 



= (V2TY 2 dX/V 



2/3 



(20) 



we /lave, 

L T (LASER) < 



b\\m 



2/3 



3(%/2y 2 rfx) T 2/3 y 1/3 



l-£ 



y 2 d + J L T ({uj) + r 2 in 



(21) 



The proof appears in Sec. A.l A few remarks are in or- 
der. First, when the total drift V — goes to zero, we set 
c = oo and thus we have D t = bl + $3* i x s x 7 use d in 
recent algorithms ll36l [T71 ETl l9l . In this case the algorithm 
reduces to the algorithm by Forster [17| (which is also the 
Aggregating Algorithm for Regression of Vovk |36|), with 
the same logarithmic regret bound (note that the last term 
of ( |2T| > is logarithmic in T, see the proof of Forster fTl\ ). 
See also the work of Azoury and Warmuth [2|. Second, 
substituting V = Tv we get that the bound depends on 
the average drift as T 2 / 3 (Ti/) 1 / 3 = Tv 1 / 3 . Clearly, to 
have a sublinear regret we must have v = o(l). Third, 
Vaits and Crammer [35] recently proposed an algorithm, 
called ARCOR, for the same setting. The regret of AR- 
COR depends on the total drift as y/TV log(T), where 
their definition of total drift is a sum of the Euclidean dif- 
ferences V = _1 II u t+i — u t II ' rather than the squared 
norm. When the instantaneous drift ||u t+ i — || is con- 
stant, this notion of total drift is related to our average drift, 
V = T^Jv. Therefore, in this case the bound of AR- 
COR [35 1 is i/ 1 / 4 T log(T) which is worse than our bound, 
both since it has an additional log(T) factor (as opposed 
to our additive log term) and since v = o(l). Therefore 
we expect that our algorithm will perform better than AR- 
COR 1 35] when the instantaneous drift is approximately 
constant. Indeed, the synthetic simulations described in 
Sec.|6]further support this conclusion. Fourth, Herbster and 
Warmuth [22 1 developed shifting bounds for general gradi- 
ent descent algorithms with projection of the weight-vector 
using the Bregman divergence. In their bounds, there is a 
factor greater than 1 multiplying the term Lt ({u t }), lead- 
ing to a small regret only when the data is close to be re- 
alizable with linear models. Yet, their bounds have better 
dependency on d, the dimension of the inputs x. Busuttil 
and Kalnishkan [6| developed a variant of the Aggregating 
Algorithm |20| for the non-stationary setting. However, 
to have sublinear regret they require a strong assumption 
on the drift V = o(l), while we require only V = o(T). 
Fifth, if V > T^^- then by setting c = x /Y 2 dMT/V 
we have, 



L T (LASER) < 6||ui| 



2VY 2 dTMV 
1 



-Y 2 d + L T ({u t })+Y 2 In 



D / 



(22) 
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(See App. B.5 for details). The last bound is linear in T 



and can be obtained also by a naive algorithm that outputs 
yt = for all t. 

5 An Hoo Algorithm for Online Regression 

Adaptive filtering is an active and well established area of 
research in signal processing. Formally, it is equivalent to 
online learning. On each iteration t the filter receives an in- 
put x f e R d and predicts a corresponding output y t . It then 
receives the true desired output y t and updates its internal 
model. Many adaptive filtering algorithms employ linear 
models, that is, at time t they output y t = wjx t . For 
example, a well known online learning algorithm [ 37 1 for 
regression, which is basically a gradient-descent algorithm 
with the squared-loss, is known as the least mean-square 
(LMS) algorithm in the adaptive filtering literature PP . 

One possible difference between adaptive filtering and on- 
line learning can be viewed in the interpretation of algo- 
rithms, and as a consequence, of their analysis. In online 
learning, the goal of an algorithm is to make predictions 
yt, and the predictions are compared to the predictions of 
some function from a known class (e.g. linear, parame- 
teized by u). Thus, a typical online performance bound 
relates the quality of the algorithm's predictions with the 
quality of some function's g(x) = u T x predictions, us- 
ing some non-negative loss measure £(wjx t ,yt)- Such 
bounds often have the following shape, 

algorithm loss with respect to observation 



function u loss 



t t 
for some multiplicative-factor A and an additive factor B. 

Adaptive filtering is similar to the realizable setting in ma- 
chine learning, where it is assumed the existence of some 
filter and the goal is to recover it using noisy observations. 
Often it is assumed that the output is a corrupted version 
of the output of some function, y = /(x) + n, with some 
noise n. Thus a typical bound relates the quality of an al- 
gorithm's predictions with respect to the target filter u and 
the amount of noise in the problem, 

algorithm loss with respect to a reference amount of noise 



£(wJx t ,u T x t ) 



<Aj2^ T ^t,yt)+B. 



The filters (see e.g. papers by Simon 11331 [32]) are 
a family of (robust) linear filters developed based on a 
min-max approach, like LASER, and analyzed in the worst 
case setting. These filters are reminiscent of the celebrated 
Kalman filter [23], which was motivated and analyzed in 
a stochastic setting with Gaussian noise. A pseudocode of 
one such filter we modified to online linear regression ap- 
pears in Fig. [2] Theory of -Hoc filters states 11331 Section 
11.3] the following bound on its performance as a filter. 



Theorem 9. Assume the filter is executed with parameters 
a > landb,c > 0. Then, for all input-output pairs (x t ,y t ) 
and for all reference vectors u t the following bound holds 
on the filter's performance, Y^f—i ( x 7 w t ~ x T u t) < 
aL T ({u t }) + &|KH 2 + cV T ({u t }) . 

From the theorem we establish a regret bound for the 
algorithm to online learning. 

Corollary 10. Fix a > 0. The total squared-loss suffered 
by the algorithm is bounded by 

M#oo) < (l + l/a + (l + a)a)i T ({u t }) (23) 
+ (1 + a) b || ui || 2 + (1 + a) cV T ({u t }) . 

Proof. Using a bound of Hassibi and Kailath fl4j Lemma 
4] we have that for all a > 0, (yt — xj Wt) < 

(1 + i) (y t ~ ^u*) 2 + (! + «) [x7 (w, - u t )] 2 . Plug- 
ging back into the theorem and collecting the terms we get 
the desired bound. □ 



The bound holds for any a > 0. We plug a = 

MW)/ (clL t ({n t }) + cV + 6||ui|| 2 ) in @ to 
get, 

M#oo) < (l + a)L T ({u t }) + c U + 6|| Ul || 2 



+ 2^ (aL T ({u t }) +cV + b Huiirj L T (W) 
< (1 + a + 2y/a)L T ({u t }) +cV + b\\u l \\ 2 
+ 2J(cV + b\\u l \\ 2 )L T ({u t }) . 



Intuitively, we expect the Hqo algorithm to perform better 
when the data is close to linear, that is when Lx ({u*}) is 
small, as, conceptually, it was designed to minimize a loss 
with respect to weights {u f }. On the other hand, LASER is 
expected to perform better when the data is hard to predict 
with linear models, as it is not motivated from this assump- 
tion. Indeed, the bounds reflect these observations. 

Comparing the last bound with pT) we note a few differ- 
ences. First, the factor (1 + a + 2y / a) > 4 of Lt ({uf }) 
is worse for Hoo than for LASER, which is a unit. Sec- 
ond, LASER has worse dependency in the drift T 2 / 3 ^ 1 / 3 , 
while for it is about cV + 2 ^JcVLt ({u t }). Third, the 
ffoo has an additive factor ~ J Lt ({u t }), while LASER 
has an additive logarithmic factor, at most. 

Hence, the bound of the based algorithm is better 
when the cumulative loss Lt ({ut}) is small. In this case, 
4Lt ({ut}) is not a large quantity, and as all the other 
quantities behave like \/Lt ({tit}), they are small as well. 
On the other hand, if Lt ({lit}) is large, and is linear 
in T, the first term of the bound becomes dominant, and 
thus the factor of 4 for the Hrx, algorithm makes its bound 
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Figure 3: Cumulative squared loss for AROWR, ARCOR, 
NLMS, CR-RLS, LASER and vs iteration. Top left - 
linear drift and linear data, top right - sublinear drift and 
linear data, bottom left - linear drift and noisy data, bottom 
right - sublinear drift and noisy data. 

higher than that of LASER. Both bounds were obtained 
from a min-max approach, either directly (LASER) or via- 
reduction from filtering (i?oo). The bound of the former 
is lower in hard problems. Kivinen et al. [26] proposed 
another approach for filtering with a bound depending on 
J2t \\ u t~ u *-i| an d not the sum of squares as we have 
both for LASER and the i/^-based algorithm. 

6 Simulations 

We evaluate the LASER and iJoo algorithms on four syn- 
thetic datasets. We set T = 2000 and d = 20. For all 
datasets, the inputs x f e M 20 were generated such that the 
first ten coordinates were grouped into five groups of size 
two. Each such pair was drawn from a 45° rotated Gaussian 
distribution with standard deviations 10 and 1. The remain- 
ing 10 coordinates were drawn from independent Gaussian 
distributions Af (0, 2). The first synthetic dataset was gen- 
erated using a sequence of vectors u t E M 20 for which the 
only non-zero coordinates are the first two, where their val- 
ues are the coordinates of a unit vector that is rotating with a 
constant rate (linear drift). Specifically, we have ||u f || = 1 
and the instantaneous drift || u t — Ut-i|| is constant. The 
second synthetic dataset was generated using a sequence of 
vectors u f £ W 20 for which the only non-zero coordinates 
are the first two. This vector in R 2 is of unit norm | u t | = 1 
and rotating in a rate of i -1 (sublinear drift). In addition ev- 
ery 50 time-steps the two-dimensional vector defined above 
was "embedded" in different pair of coordinates of the ref- 
erence vector u t , for the first 50 steps it were coordinates 
1, 2, in the next 50 examples, coordinates 3, 4, and so on. 



This change causes a switch in the reference vector u t . For 
the first two datasets we set y t = x^iif (linear data). The 
third and fourth datasets are the same as first and second 
except we set y t — xju t + n t where n t ~ Af (0,0.05) 
(noisy data). 

We compared six algorithms: NLMS (normalized least 
mean square) [3, 5| which is a state-of-the-art first-order 
algorithm, AROWR (AROW for Regression) |14|. AR- 
COR [35], CR-RLS HUIH, LASER and H^. The al- 
gorithms' parameters were tuned using a single random se- 
quence. We repeat each experiment 100 times reporting the 
mean cumulative square-loss. The results are summarized 
in Fig.|3](best viewed in color). 

For the first and third datasets (left plots of Pig. [3j we ob- 
serve the superior performance of the LASER algorithm 
over previous approaches. LASER has a good tracking 
ability, fast learning rate and it is designed to perform well 
in severe conditions like linear drift. 

For the second and fourth datasets (right plots of Fig. [3j, 
where we have sublinear drift level, we get that ARCOR 
outperforms LASER since it is especially designed for sub- 
linear amount of data drift, yet, outperforms ARCOR 
when there is no noise (top-right plot). 

For the third and fourth datasets (bottom plots of Fig. [3j, 
where we added noise to labels, the performance of 
degrades, as expected from our discussion in Sec. [3] 

7 Related Work 

The problem of performing online regression was stud- 
ied for more than fifty years in statistics, signal process- 
ing and machine learning. We already mentioned the work 
of Widrow and Hoff (37 1 who studied a gradient descent 
algorithm for the squared loss. Many variants of the algo- 
rithm were studied since then. A notable example is the 
normalized least mean squares algorithm (NLMS) J5] 
that adapts to the input's scale. 

There exists a large body of work on this problem proposed 
by the machine learning community, which clearly cannot 
be covered fully here. We refer the reader to a encyclope- 
dic book in the subject ifTOl . Gradient descent based algo- 
rithms for regression with the squared loss were proposed 
by Cesa-Bianchi et al. (SI about two decades ago. These 
algorithms were generalized and extended by Kivinen and 
Warmuth [24] using additional regularization functions. 

An online version of the ridge regression algorithm in 
the worst-case setting was proposed and analyzed by Fos- 
ter IPT81 . A related algorithm called Aggregating Algorithm 
(AA) was studied by Vovk [20|, and later applied to the 
problem of linear regression with square loss [36|. The 
recursive least squares (RLS) ETTl is a similar algorithm 
proposed for adaptive filtering. Both algorithms make use 
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of second order information, as they maintain a weight- 
vector and a covariance-like positive semi-definite (PSD) 
matrix used to re-weight the input. The eigenvalues of 
this covariance-like matrix increase with time t, a property 
which is used to prove logarithmic regret bounds. 

The derivation of our algorithm shares similarities with the 
work of Forster [ 17 1 and the work of Moroshko and Cram- 
mer ll29ll . These algorithms are motivated from the last-step 
min-max predictor. While the algorithms of Forster [17] 
and Moroshko and Crammer [29 1 are designed for the sta- 
tionary setting, our work is primarily designed for the non- 
stationary setting. Moroshko and Crammer [29 1 also dis- 
cussed a weak variant of the non-stationary setting, where 
the complexity is measured by the total distance from a 
reference vector u, rather than the total distance of con- 
secutive vectors (as in this paper), which is more rele- 
vant to non-stationary problems. Note also that Moroshko 
and Crammer [ 29 1 did not derive algorithms for the non- 
stationary setting, but just show a bound of the weighted 
min-max algorithm (designed for the stationary setting) in 
the weak non-stationary setting. 

Our work is mostly close to a recent algorithm [35 1 called 
ARCOR. This algorithm is based on the RLS algorithm 
with an additional projection step, and it controls the eigen- 
values of a covariance-like matrix using scheduled resets. 
The Covariance Reset RLS algorithm (CR-RLS) flT] [30] 
19 1 is another example of an algorithm that resets a covari- 
ance matrix but every fixed amount of data points, as op- 
posed to ARCOR that performs these resets adaptively. All 
of these algorithms that were designed to have numerically 
stable computations, perform covariance reset from time to 
time. Our algorithm, LASER, is simpler as it does not in- 
volve these steps, and it controls the increase of the eigen- 
values of the covariance matrix D implicitly rather than ex- 
plicitly by "averaging" it with a fixed diagonal matrix (see 
(|7]i). The Kalman filter [23 1 and the algorithm (e.g. 
[33 1) designed for filtering take a similar approach, yet the 
exact algebraic form is different (Fig. [T] vs. Fig.|2|. 

ARCOR also controls explicitly the norm of the weight 
vector, which is used for its analysis, by projecting it into a 
bounded set, as was also proposed by Herbster and War- 
muth [22 1 . Other approaches to control its norm are to 
shrink it multiplicatively ll25ll or by removing old exam- 
ples [7[. Some of these algorithms were designed to have 
sparse functions in the kernel space (e.g. lfl3l [L5ll ). Note 
that our algorithm LASER is simpler as it does not perform 
any of these operation explicitly. Finally, few algorithms 
that employ second order information were recently pro- 
posed for classification l9l [T4lfT2l . and later in the online 
convex programming framework lfT6l l28l . 



8 Summary and Conclusions 

We proposed a novel algorithm for non-stationary online 
regression designed and analyzed with the squared loss. 
The algorithm was developed from the last-step minmax 
predictor for non- stationary problems, and we showed an 
exact recursive form of its solution. We also described an 
algorithm based on the filter, that is motivated from a 
min-max approach as well, yet for filtering, and bounded 
its regret. Simulations showed its superior performance in 
a worst-case (close to a constant per iteration) drift. 

An interesting future direction is to extend the algorithm 
for general loss functions rather than the squared loss. Cur- 
rently, to implement the algorithm we need to perform ei- 
ther matrix inversion or eigenvector decomposition, we like 
to design a more efficient version of the algorithm. Addi- 
tionally, for the algorithm to perform well, the amount of 
drift V or a bound over it are used by the algorithm. An 
interesting direction is to design algorithms that automati- 
cally detect the level of drift, or are invariant to it. 

A Proofs 

A.1 Proof of Corollary [8] 

Proof. Plugging Lem. [5] in Thm. [4] we have for all 
(ui . . .u T ), 

L T (LASER) < folluiH 2 + cV + L T {{u t }) 

T 



Y 2 In 



L 



D 7 



Using Lem. [7] we bound the RHS and get 



Lt (LASER) < 6 ||ui|| +L T ({u t }) +F 2 ln 



D 7 



c -± Y 2 Tr (D ) + cV 



1 , , 3X 2 + \/X 4 +4X 2 c , , 
-c^y 2 T(imax „ , b + X 2 



The term c 1 Y 2 Tr (Do) does not depend on T, because 
c -i r 2 Tr ( Dq ) = c -i Y 2 d j^_ _ I ^-Y 2 d . To show (T3TJ, 



note that V < T^X^ ^> fi < 



dXT 



2/ .3 



( 



We thus have that (3X 2 + VX 4 + AX 2 c) /2 < 
3X 2 + VSX 2 c^j /2 < y/8X 2 c, and we get a bound on 
the right term of ( fl9j >, 

max I (ZX 2 + y/X 4 + 4X 2 c^j /2, b + X 2 j < 
max{V8X 2 c, b + X 2 j < 2XV2^ . 
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Using this bound and plugging the value of c from f20| > we 
bound ([19) and conclude the proof, 



/ V2TY 2 dX~ 



2/3 



V + Y 2 Td2X 



( V2TY 2 dx" 



-2/3 



3 (y2TY 2 dX^j 2/ V 1 



/3 



□ 
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Proof. By definition, P\ (ui) = Qi (ui) = fe||ui|| 2 + 




B APPENDIX 



and indeed Di = bl + XixJ , ei = j/ixi, and f\ = y\. 
We proceed by induction, assume that, Pt-i (u*-i) = 
u^Dt-iUt-i - 2u t l l 1 e f _i + ft-i. Applying Lem. [I] 
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we get, 

P t (u t ) =min uJ_ 1 T) t „ 1 u t _ 1 - 2uJ_ 1 e t _ 1 + f t _ x 

U t _l I 

+ c\\u t - u t _i|| 2 + (y t - u7x t ) 2 ^j 

= min u7_i (cl + D t _i) u t _i 
ut-i V 

- 2ul^ (cu t + e t _i) + ft-i + c ||u t || 2 

+ (yt-ujxt) 2 



- (cu t + e t _i) T (cl + D t _i) 1 (cu t + e f _i) 



+ /t_i+c||ut| 



(yt -u t T x t ) 2 



=u t ' (^cl + xtx, 1 -c^(ct + D t _i) "Jut 
-2u^ c(d + D t _ 1 ) _1 e t _ 1 +y t x t 

- ej_, (cl + Dt-ir 1 et-i + ft-i + yf 

Using Woodbury identity we continue to develop the last 
equation, 

=uj (cl + x t x7 



c-^-c" 2 + c-^iy 1 }) u t 



- 2uJ ^(l + c l D t _i) e t _i +j/ t x t 

- e7_! (cl + Dt-i) -1 e t _i + ft-t + y 2 t 



x t x t u t 



2uJ 



(l + c -x D t _i) ei_i+y t x t 

- e7_! (cl + D^r 1 e*_i + ft-i + yf , 

and indeed D t = (T>i-i + c _1 l) _1 + x t xj , 
e t = (l + c _1 D t _i) e t -i + y t xt and, f t = f t -i 
ej_ 1 (cl + D t _i) 1 ej_i + y 2 , as desired. 

B.3 Proof of Lem.|3] 



□ 



Proof. We first use the Woodbury equation to get the fol- 
lowing two identities 

D,- 1 - [(D^+c-^r'+xtx?"]" 1 

(D^ + c- 1 !) x t x7 (D^ + c- 1 !) 



l + x7 (DT^ + c-*l) x t 



and 



(l + c _1 D 



Multiplying both identities with each other we get, 

Bt 1 (i + c-^t^y 1 

(D^+c- 1 !) x t x7 (D~_\ +C" 1 !) 
l+x7 (D t -\+c-il) x t 



= D7 1 , - V '"V . / * (24) 

l + x7(D ( l 1 1 +c-il)x t 

and, similarly, we multiply the identities in the other order 
and get, 

(i + c^Dt-iJ^Dr 1 

_ D7_W (D^+c" 1 !) 
- 1 " 4 - 1 l + x7(D-\+c-il)x i (25) 

Finally, from (|24j> we get, 

(i + c-^t-i)" 1 Dr^x^Dr 1 (i + c-^t.!) -1 



D: 



Dr (I 



+ (I + c- 1 D t _ 1 ) 
+^1] 

= (I + c^Dt-i)" 1 Dr^x^Dr 1 (I + c-^t.i)" 1 

- D r-\ 



I-c-^D^ + c- 1 !)- 1 



D^ + c" 1 ! 



l+x7 (Dr^+c-il) x t 

We develop the last equality and use ( |24] i and ( [25) in the 
second equality below, 

= (I + c-^t.i)" 1 DrS^Dr 1 (I + c^D^)" 1 

Dr^xtx^D- 1 ! 



D-\ + D t -_\ - 



l + x7 (D^+c-il) x t 



D t - 1 1 x t Xt T (Dr_ 1 1 +c- 1 I 

u (-i 



DtA 



l+x7 (D^+c-il) xt 

(D^+c- 1 !) x t x t T D7_ 1 1 
l+x7 (D-^+c-il) xt 

D7_ 1 1 Xtx7D t -_ 1 1 



x*x7 



l + x7 (Dr^+c-^xt 
x^ (D-\ + c- 1 !) XtDr^XtxTDT^ 
(l+x7 (D^+c-il)^) 2 



~< 



□ 
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B.4 Derivations for Thm.H 



B.6 Proof of Lent.© 



(Vt-Vt) + min Q t _i (m, . . . , u t _i) 

ui,...,u t _i 

- min Q t (ui,...,u t ) 

ui,...,u t 

: (Vt - Vtf - e^D^e^ + f t _i + ejT>t le t - ft 
: {Vt ~ Vtf ~ e^jD^et-i 
+ e7_ 1 (cI + D t _ 1 )" 1 e f _ 1 - 2 / t 2 



d; 



(I + c 1 D t -i) e t _i + 2/ t x t 

where the last equality follows ([8]). We proceed to develop 
the last equality, 

= (Vt ~ Vt) 2 - e^iD^ej-i 
+ e7_ 1 (cI + D 4 _ 1 )" 1 e t _ 1 -y t 2 
+ ej_, (I + c-^t-x)" 1 D t 1 (I + c^Bt^y 1 et_i 
+ 2ytx t T D t - 1 (I + c-^t-i) -1 et_i + j^xjD^xt 

= (y t -y t ) 2 + et T _i ^ - 

(i + c- 1 ^.!)" 1 [^(i + c- 1 ^)- 1 

+C- 1 !] Je t - 1 +2y t xjT>t 1 (I + c _1 D{_i) _1 e t _i 

+ W?x7 D r lx f " Vt ■ 
B.5 Details for the bound d22> 



To show the bound d22), note that, V > T^4^ ^ M > 



TY 2 dM 



V 



We thus have that the right term of ( fT9] > is 



upper bounded as follows, 

f 3X 2 + V^ 4 + 4X 2 c 



max ■ 



6 + 



<max|3X 2 , \JX A + 4X 2 c,b + X 2 } 
< max |3X 2 , V2X 2 , V8X 2 c, b + X 2 } 

=V8X Z max < . . Jc, , 



--V8X 2 



\ 



(3X 2 ) 2 (6 + X 2 ) z 
Uli ' X < 8X 2 ' C ' 



8X 2 



=\/8X 2 v /max {A t > c l < V / 8X 2 V /^ = M . 
Using this bound and plugging c = y/Y 2 dMT/V 
we bound (19), ^J Y2 f IT V + , } TdY 2 M = 

2VY 2 dMTV . 



Proof. For the first property of the lemma we have that 
/(A) = \p/(\ + j3) + x 2 < /3 x 1 + a; 2 . The sec- 
ond property follows from the symmetry between /3 and A. 
To prove the third property we decompose the function as, 



/(A) = A 



x . Therefore, the function is bounded 



X+13 

by its argument /(A) < A if, and only if, —~£fa + a; 2 < 0. 
Since we assume x 2 < j 2 , the last inequality holds if, 

-A 2 + 7 2 A+7 2 /3 < 0, which holds for A > ^+^+^ 2 ^ 



To conclude. If A > ^+-\A 4 +47 2 ^ then /(A) < A. Other- 
wise, by the second property, we have, /(A) < A + j 2 < 

- — h 7 = — — ! — , as required. □ 



